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It is argued that the present log-normal distribution of language sizes is, to a large 
extent, a consequence of demographic dynamics within the population of speakers of 
each language. A two-parameter stochastic multiplicative process is proposed as a model 
for the population dynamics of individual languages, and applied over a period spanning 
the last ten centuries. The model disregards language birth and death. A straightforward 
fitting of the two parameters, which statistically characterize the population growth 
rate, predicts a distribution of language sizes in excellent agreement with empirical data. 
Numerical simulations, and the study of the size distribution within language families, 
validate the assumptions at the basis of the model. 
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1. Introduction 

Statistical aspects of the evolution of languages have attracted, in the last few ye ars, 
a great deal of attention among physicists and mathematicians."'^ 2 3 4 5 6 7 8,9 , 10 | 11 | 
One of the better established quantitative empirical facts about extant languages is 
their size distribution, namely, the frequency of languages with a given number of 
speakers. Naturally, explaining the origin of this distribution is one the aimed goals 
of mathematical modeling in this field. 

Recent work has built up on variations of two basic models of language evolution, 
Schulze's model^ and Viviane's model,^ both of them mostly focused on the effects 
of mutation of linguistic features which give rise to new languages, and including the 
possibility of language extinction. These models, however, disregard the fact that, 
over periods which are short as compared with the typical time scales of language 
evolution, the speakers of a given language can substantially vary in number just by 
the effect of population dynamics. For instance, in the last five centuries -a period 
which includes the culturally devastating European invasion of the rest of the globe- 
perhaps 50 % of the world's languages went extinct (among them, two thirds of the 
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2,000 preexisting native American language&i^ or changed drastically. In the same 

1 

period, however, the world's population grew by a factor of twelve or more. 

Demographic effects ha ve bee n very recently incorporated to a model of lan- 
guage evolution by Tuncay, ^^^ ! ^^ ! in the form of a stochastic multiplicative model 
for population growth (see also Ref. [T?|) . With suitable tuning of its several pa- 
rameters, Tuncay's model is able to reasonably reproduce the observed distribution 
of language sizes, as a result of numerical simulations.^m The complexity of the 
model -which, in its full form,'!^ includes population growth along with language 
inheritance, branching, assimilation, and extinction- makes it however difficult to 
identify the specifc mechanism that shapes the distribution of sizes. 

In this paper, I show that the empirical distribution of language sizes can be ac- 
curately explained taking into account just the effect of demographic processes. 
This possibility was already pointed out, for the specific case of the languages 
of New Guinea, by Novotny and Drozd. I propose a two-parameter stochastic 
model where the populations speaking different languages evolve independently of 
each other. During the evolution -which, in the realization of the present model, 
is assumed to span 1,000 years- language creation, assimilation, and extinction are 
disregarded. This assumption does not discard mutations inside a given language, 
which may lead to its internal evolution, but each language preserves its identity 
as a cultural and demographic unit along the whole period. The model is analyti- 
cally tractable, and its two parameters can be fitted a priori from empirical data. 
Numerical simulations confirm the prediction that the distribution is essentially in- 
dependent of details in the initial condition -the distribution of sizes ten centuries 
ago- so that, in a sense, the present distribution is the unavoidable consequence 
of just demographic growth. The model is further validated by showing that the 
distribution of sizes of languages belonging to a given family has the same shape as 
the overall distribution. These results strongly suggest that population dynamics is 
a necessary ingredient in models of linguistic evolution. 



2. Present distribution of language sizes 

It is a well established empirical fact that the frequency of languages with 
a given nu mber of speakers p is satisfactorily approximated by a log-normal 
distribution !-*^ '^ 1 ^ 1 Accordingly, the distribution of language sizes as a function of 
the log-size, q — Inp, is approximately given by a Gaussian, 
Q{q) = (27r02)-i/2 _ -)2/202] ^ 

where q and 9 are, respectively, the mean value and the mean square dispersion of 
q. The quantity Q{q)dq gives the fraction of languages with log-sizes in (g, q + dq). 
Significant departures from the log-normal distribution are limited to small language 
sizes, up to the order of a few tenths speakers.^^^ 

Ethnologue statistical summaries j"*^^ whose data correspond to collections done 
mostly during the 1990s, list the number of languages with sizes within decade bins 
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Fig. 1. The histogram shows the number of languages whose sizes p have logarithms q = Inp 
in the corresponding bins, according to the Ethnologue statistical summariesliS The curve is the 
Gaussian distribution Q{q) of Eq. JlJ with the parameters of Eq. ((Jjl. For comparison with the 
histogram, Q{q) is multiplied by the total number of languages, L = 6, 604, and by the bin width 
Aq = In 10. Note the deviation from normality in the leftmost column of the histogram.li^ 

(1 to 9 speakers, 10 to 99 speakers, 100 to 999 speakers, and so on), and give the 
number of speakers within each bin. The total number of languages in the list is 
L = 6,604, accounting for an overall population a little above 5.7 x 10^ speakers. 
The sizes of 308 languages of the database are unknown. In terms of the distribution 
Q{q), the number of languages in the bin between 10*"' and 10*^+^ speakers is 

Ak+l) In 10 

Lk = L Qiq) dq, (2) 

^fclnlO 

while the total population speaking the languages in the same bin is 

<.(fc+l)lnlO 

Pk^ L exp{q) Q{q) dq. (3) 

J kin 10 

The values of Lk and pk provided by the Ethnologue statistical summaries can thus 
be used to estimate the parameters q and in the distribution Q{q) of Eq. ifTj). 
Least-square fitting yields 

q = 8.97 ±0.15, 61 = 3.04 ± 0.09. (4) 

Figure [1] shows, as a histogram over the variable q, Ethnologue's data for Lk- 
The curve is a plot of the function LQ{q) with the parameters of Eq. (|3]), whose 
integral over the histogram bins approximates the empirical values of Lk- For easier 
comparison with the histogram, LQ{q) is further multiplied by the bin width Aq = 
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In 10. The aim in the foUowing is to provide a model which explains the fitted 
distribution Q{q). 

3. Demographic evolution of language sizes 

The log-normal shape of the distribution of language sizes suggests that a multi- 
plicative stochastic process is at work in the evolution of the number of speakers of 
each language. This is in turn consistent with the hypothesis that, over sufficiently 
long time scales, the population speaking a given language evolves autonomously, 
driven just by demographic processes. 

Let be the number of speakers of language j at time t, and assume that at 
time t + 1 -one year later, say- the population has changed 

= (5) 

(i) 

where the growth rate is a positive stochastic variable drawn from some speci- 
fied distribution. As shown below, the mean value and the mean square dispersion 
over this distribution can be estimated from empirical data. In terms of the initial 
population Pq"^ , the number of speakers at time t is 

pW^pf^'nai^). (6) 

s=0 

I suppose now that, during the whole t-step process, the distribution of the growth 
rate a[''^ is (i) the same for all languages, and (ii) does not vary with time. Moreover, 
(iii) no language is created or becomes extinct. Admittedly, these are rather bold 
assumptions for the world's history during the last 1,000 years. However, in view of 
the lack of reliable data over such period, they are at least justified by the sake of 
simplicity. 

I identify the evolution of the world's languages as L = 6, 604 realizations of the 
multiplicative stochastic process ([5]). By virtue of assumption (i), all the realizations 
are statistically equivalent. In this interpretation, the present distribution of log- 
sizes, given by Eqs. ITl) and (HI), is the probability distribution for the variables 
pI obtained from those realizations. Thus, my aim is to quantitatively relate the 
distribution Q{q) to the outcome of the stochastic process. 

The total population Pt at time t is 

Averaging this expression over realizations of the stochastic variable aj"''' , and as- 
suming that the growth rate is not self-correlated in time, we find (Pt) = {aYPo, 
where (a) is the mean growth rate and Pq is the initial total population. In order 
to apply this analysis to the world's population in the last ten centuries, let us take 
Pq — 3.1 X 10^, which is the estimated population by the year 1000.1^ Ascribing 
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the total population accounted for by Ethnologue's data to the year 2000, and as- 
sociating this number with the population averaged over realizations of the growth 
rate, we have (Pt) — 5.7 x 10^ and t = 10"^. This makes it possible to evaluate the 
mean growth rate per year as 

(") - m)/Pof' « 1-0029. (8) 

To evaluate the dispersion of the growth rate, it is useful to introduce its relative 
deviation with respect to the average, df''\ as 

(9) 

The average value and the mean square dispersion of the deviation Sy are, respec- 
tively, 

(<5p))^0, c7,^(<5p)i/2=a„/(a}, (10) 

a) 

where ctq. is the mean square dispersion of the growth rate. Assuming that 6). is 
always sufficiently small to approximate ln(l -I- Jp'') « Jp'' — Si''^^ /2, Eq. (I6l) can be 
rewritten for the log-size — Inpf as 

= q^„=^+nn{a}+Y,[si^^ - <5p)V2] . (11) 

This equation shows explicitly that, besides the deterministic growth given by the 
term tln(a), the evolution of the logarithm of the population speaking a given 
language is driven by an additive stochastic process. Thus, by virtue of the central 
limit theorem, the distribution Q{q) must converge to a Gaussian like in Eq. ([1]), 
starting from any distribution of initial log-sizes qQ"*' . For the time being, however, 
the question remains whether the times relevant to the process are enough to allow 
for the development of the Gaussian shape and, in particular, to suppress any effect 
ascribable to a specific initial distribution. 

Unfortunately, the initial sizes p^^ -the number of speakers of each language 
1,000 year ago- are not known for most languages. However, their effect on the 
present size distribution can be readily assessed. Averaging Eq. (|lip over realizations 
of the stochastic process and over the distribution of initial log-sizes -and always 
assuming that the deviations are small- yields, for the mean square dispersion 
of 

al = al + ajt, (12) 

where ctq is the mean square dispersion of the initial log-sizes. The empirical es- 
timation for cTg is the value of given in Eq. In turn, an upper bound can 
be given for ctoj the maximal mean square dispersion in the log-size distribu- 
tion oi L = 6,604 languages with a total population of Pq = 3.1 x 10^ speakers, 
and at least one speaker per language. This maximal dispersion is obtained with 
L — 1 languages with exactly one speaker, and just one language with the remaining 
Pq — L + 1 speakers. Clearly, this is an unlikely distribution for the languages 1,000 
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years ago, but represents the worst-case instance, with the largest contribution of 
the initial condition to the present dispersion of log-sizes. In this extreme situation, 
the estimation for the initial mean square dispersion is CTq w L^^ Pq w 0.058. 
Meanwhile, according to Eq. (H)), cr^ « 9.2. In the right-hand side of Eq. (|12p . there- 
fore, the largest contribution by far is given by the second term, and that of the 
initial distribution is essentially negligible. This makes it possible to calculate the 
mean square dispersion of the deviations (Jp'' as 

as ~ t-^'^e w 0.096 (13) 

for t — lO'"*, thus completing the statistical characterization of the growth rate per 
year, \ Note that this value of as is in agreement with the assumption of small 
relative deviations in the growth rate. 

Summarizing the results of this section, I have argued that the present log- 
normal distribution of language sizes can be seen as the natural consequence of 
population dynamics driven by a stochastic multiplicative process, Eq. ([5]), where 
the evolution of each language is interpreted as a realization of the process. Using 
data on the total population growth during the last 1,000 years -a period over 
which I neglect language birth and death- and fitting only one parameter [as) from 
the distribution itself, I was able to statistically characterize the growth rate per 
year which explains the present distribution, giving its mean value and mean square 
dispersion. Also, I have advanced that the dispersion of language sizes ten centuries 
ago has essentially no effect on its present value. It is now useful to validate these 
conclusions with numerical realizations of the model, and with applications within 
language families. 

4. Validation of the model 
4.1. Numerical results 

In this section, I present results for series of L = 6, 604 numerical realizations of 
the multiplicative stochastic process ([5]). The mean value and the mean square 
dispersion of the growth rate of' are fixed according to the values estimated in 
Section[3l Eqs. ([8]) and (jl3p . In order to speed up the computation, individual values 
of of' are drawn from a square distribution centered at (a), with a width which 
insures the correct mean square dispersion. In agreement with my main assumptions, 
I avoid the possibility that languages die out by replacing the absorbing boundary 
at p = 1, below which a language should become extinct, by a reflecting boundary. 

In view of the discussion in the previous section, the convergence of the distri- 
bution of log-sizes to a Gaussian is guaranteed by the central limit theorem. The 
emphasis in the simulations is thus put on the possible effects of the distribution 
of initial sizes . Figure [21 shows, as normalized histograms, numerical results for 
single series of L realizations of the stochastic process after t = 10"^ steps, from four 
different initial conditions. In (a), all the languages have exactly the same initial 
size po- In (b), the initial sizes are uniformly distributed between p = and Pmax- In 
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(c), the distribution of initial sizes is also uniform, but spans the interval (p*, 2p*). 
Finally, in (d) the initial distribution is more heterogeneous, with half the languages 
having size and the other half having size lOpt. The parameters po, Pmax, P* , and 
p\ which characterize these initial distributions, are fixed by the condition that the 
total population is Pq = 3.1 x 10^. The curve in all plots is the Gaussian of Eq. ([J) 
with the parameters of Eq. (|4]). The agreement in cases (a) to (c) is excellent. Only 
in case (d), where the initial distribution is specially heterogeneous -and, certainly, 
not a likely representation of the distribution of language sizes ten centuries ago- 
the deviations are larger, though the agreement is still very reasonable. 
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Fig. 2. Normalized histograms for L = 6, 604 language sizes after 10'' steps of the stochastic 
process l(5}, starting from the four initial conditions described in the text. Curves stand for the 
expected Gaussian distribution, Eq. JTJ, with the parameters of Eq. Q. 

The difference between the numerical results of case (d) and the expected Gaus- 
sian function resides not only in the width of the distribution but also in its mean 
value. To a much lesser extent, this discrepancy is also visible in case (b). This shift 
between the distribution peaks can be understood in terms of the average of Eq. 
pT|) over both the realizations of the growth rate and the initial log-sizes , 

{qi'^)^{q^''^)+tiln{a)-aj/2). (14) 

Besides the contribution of the multiplicative stochastic process, given by the term 
proportional to the time t, the mean value in Eq. (|14p depends on the average 
initial log-size {Qq '')- In spite of the fact that the total initial population and the 
number of languages are the same for all simulations -which always gives the same 
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average size per language- the average log-size depends on the specific initial dis- 
tribution. Thus, the final mean values for different initial conditions are generally 
shifted with respect to each other. 

As a consistency test for the assumption that languages do not die out along the 
evolution period considered here, I have also run simulations taking into account 
the absorbing boundary at p — 1. Namely, when the size of a language decreases 
below one speaker during its evolution, it is considered to become extinct and that 
particular realization of the stochastic process is interrupted. Among the four initial 
conditions considered above, those who undergo larger extinctions are, not unex- 
pectedly, (b) and (d) -as they produce the largest shifts to the left in the log-size 
distributions. In both cases, however, the fraction of extinct languages is around 
1 %, which validates the above assumption quite satisfactorily. 



4.2. Size distribution within language families 

A crucial consequence of the hypotheses on which the present model is based -in 
particular, the mutual independence of the size evolution of different languages- 
is that its predictions should hold not only for the ensemble of all the world's 
languages, but also for any sub-ensemble to which the homogeneity assumptions 
(i) and (ii) reasonably apply. In other words, the final log-normal shape of the size 
distribution should also result from the evolution of, say, the languages of a given 
geographical region, or belonging to a given language family. This can be readily 
assessed from empirical data on the number of speakers of individual languages 
and, in fact, has already been pointed out for a set of some 1,000 New Guinean 
languages. 

Here, I analyze the size distribution of languages belon ging to each one of the 
four largest families, according to Ethnologue's classification! --'Population data for 
individual languages were obtained from Ethnologue's online databases. Figure [3] 
shows histograms of log-sizes for those families. To ease the comparison, the column 
width and the horizontal scale are the same as in Fig. [TJ Curves stand for least- 
square fittings with Gaussian functions as in Eq. ([1]). The resulting parameters are 
q = 10.3 and 9 — 2.3 for Niger-Congo; q = 8.2 and 9 = 2.4 for Austronesian; q = 7.2 
and 9 = 2.0 for Trans-New Guinea; and q — 9.8 and 9 — 3.5 for Indo-European. 
The quality of the agreement between the data and the Gaussian fitting is clearly 
comparable to that of the whole language ensemble in Fig. [TJ 

Note the interesting fact that the mean square dispersion 9 -which, according 
to the present model, results from the dispersion in the population growth rate- is 
sensibly larger for the Indo-European family than for the other three. Surely, this 
is a direct consequence of the highly diverse fate of European languages in the last 
few centuries. In any case, the four mean square dispersions are not far from the 
overall value given in Eq. ([4]) . 



2, 2008 2:47 



WSPC/INSTRUCTION FILE zanetter 



Demographic growth and the distribution of language sizes 9 




3 6 9 12 15 18 21 3 6 9 12 15 18 21 

log-size q 



Fig. 3. Histograms of the number of languages as a function of the language log-size for four 
language families: (a) Niger-Congo (1,495 languages), (b) Austronesian (1,246 languages), (c) 
Trans-New Guinea (561 languages), and (d) Indo-European (430 languages). Curves are Gaussian 
least-square fittings. 

5. Conclusion 

In this paper, I have argued that the present log-normal distribution of language 
sizes is essentially a consequence of demographic dynamics in the population of 
speakers of each language. In fact, an isolated population can largely vary in num- 
ber within time scales which are short as compared with those involved in substan- 
tial language evolution. To support this suggestion, I have proposed a stochastic 
multiplicative process for the population dynamics of individual languages, where 
language birth and death are disregarded. Within some bold assumptions on the 
geographical and temporal homogeneity of the process, the model is completely 
specified by two parameters, which give the average growth rate of the population 
and its mean square dispersion. The average growth rate is completely defined by 
the initial and the final world population. I have chosen to apply the model on the 
period spanning the last 1,000 years, for which reliable data on the world's popula- 
tion growth are available. The mean square dispersion of the growth rate is the only 
parameter which I fitted in an admittedly ad-hoc manner, using the present distri- 
bution of language sizes. It seems unlikely to find estimations for this parameter 
from independent historical sources, which would require to have reliable records 
on the population change year by year. Note that the dispersion in the growth rate 
is determined by a variety of fac tors , including fluctuations in birth and mortality 
frequencies and migration events.^^ 

Once the two parameters are fitted, the model is able to produce, as the result 
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of evolution along ten centuries, excellent predictions of the present distribution 
of language sizes. Numerical simulations show that the final distribution is largely 
independent on the initial condition. This emphasizes the point that, irrespectively 
of the long-range historical processes that may have determined the distribution of 
language sizes of 1,000 years ago -including language birth and death, branching, 
mutation, competition, assimilation, and/or replacement- population dynamics is 
by itself able to explain the present distribution. This conclusion had already been 
advanced for the case of New Guinean languages in Ref. [TBI In fact, realizing that 
the same log-normal profile is found in the size distribution inside language families, 
is a further validation of the present model. In view of the present arguments, one 
can moreover safely assert that the distribution of language sizes was already a 
log-normal function, with different parameters, in year 1000. 

It is clear that in the last few years -with the advent of a host of new mechanisms 
of globalization which endanger cultural diversity- rnany, o r most, of the world's 
languages are threatened by the risk of extinction !^ ! ^ 7 | 20 | ^j-^jg j.jg]j jg peculiarly 
acute for those languages whose number of speakers is below a few hundreds - 
including the range where the distribution of sizes differs from the log-normal profile 
(cf. Fig. [1]). It should be a program of obviously urgent interest to study in detail 
what are the relevant processes at work in that range, even if they escape the domain 
of the statistical physicists' approaches. 
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